Search CORE

9 research outputs found

Improving sequence-to-sequence speech recognition training with on-the-fly data augmentation

Author: Nguyen Thai-Son
Niehues Jan
Stueker Sebastian
Waibel Alex
Publication venue
Publication date: 01/01/2020
Field of study

Sequence-to-Sequence (S2S) models recently started to show state-of-the-art performance for automatic speech recognition (ASR). With these large and deep models overfitting remains the largest problem, outweighing performance improvements that can be obtained from better architectures. One solution to the overfitting problem is increasing the amount of available training data and the variety exhibited by the training data with the help of data augmentation. In this paper we examine the influence of three data augmentation methods on the performance of two S2S model architectures. One of the data augmentation method comes from literature, while two other methods are our own development - a time perturbation in the frequency domain and sub-sequence sampling. Our experiments on Switchboard and Fisher data show state-of-the-art performance for S2S models that are trained solely on the speech training data and do not use additional text data.Comment: To appear in ICASSP 202

arXiv.org e-Print Archive

Maastricht University Research Portal

Crossref

Relative Positional Encoding for Speech Recognition and Direct Translation

Author: Ha Thanh-Le
Nguyen Thai-Son
Nguyen Tuan-Nam
Niehues Jan
Pham Ngoc-Quan
Salesky Elizabeth
Stueker Sebastian
Waibel Alexander
Publication venue
Publication date: 01/01/2020
Field of study

Transformer models are powerful sequence-to-sequence architectures that are capable of directly mapping speech inputs to transcriptions or translations. However, the mechanism for modeling positions in this model was tailored for text modeling, and thus is less ideal for acoustic inputs. In this work, we adapt the relative position encoding scheme to the Speech Transformer, where the key addition is relative distance between input states in the self-attention network. As a result, the network can better adapt to the variable distributions present in speech data. Our experiments show that our resulting model achieves the best recognition result on the Switchboard benchmark in the non-augmentation condition, and the best published result in the MuST-C speech translation benchmark. We also show that this model is able to better utilize synthetic data than the Transformer, and adapts better to variable sentence segmentation quality for speech translation.Comment: Submitted to Interspeech 202

arXiv.org e-Print Archive

Maastricht University Research Portal

Crossref

Linguistic unit discovery from multi-modal inputs in unwritten languages: Summary of the "Speaking Rosetta" JSALT 2017 Workshop

Author: Arthur Philip
Besacier Laurent
Black Alan
Ciannella Francesco
Du Mingxing
Dupoux Emmanuel
Godard Pierre
Hasegawa-Johnson Mark
Larsen Elin
Merkx Danny
Metze Florian
Mueller Markus
Neubig Graham
Ondel Lucas
Palaskar Shruti
Riad Rachid
Scharenborg Odette
Stueker Sebastian
Wang Liming
Publication venue
Publication date: 14/02/2018
Field of study

We summarize the accomplishments of a multi-disciplinary workshop exploring the computational and scientific issues surrounding the discovery of linguistic units (subwords and words) in a language without orthography. We study the replacement of orthographic transcriptions by images and/or translated text in a well-resourced language to help unsupervised discovery from raw speech.Comment: Accepted to ICASSP 201

arXiv.org e-Print Archive

Hal - Université Grenoble Alpes

INRIA a CCSD electronic archive server

Speech technology for unwritten languages

Author: Besacier Laurent (author)
Black Alan W. (author)
Godard Pierre (author)
Hasegawa-Johnson Mark (author)
Metze Florian (author)
Mueller M (author)
Neubig Graham (author)
Scharenborg O.E. (author)
Stueker Sebastian (author)
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2020
Field of study

Speech technology plays an important role in our everyday life. Among others, speech is used for human-computer interaction, for instance for information retrieval and on-line shopping. In the case of an unwritten language, however, speech technology is unfortunately difficult to create, because it cannot be created by the standard combination of pre-trained speech-to-text and text-to-speech subsystems. The research presented in this article takes the first steps towards speech technology for unwritten languages. Specifically, the aim of this work was 1) to learn speech-to-meaning representations without using text as an intermediate representation, and 2) to test the sufficiency of the learned representations to regenerate speech or translated text, or to retrieve images that depict the meaning of an utterance in an unwritten language. The results suggest that building systems that go directly from speech-to-meaning and from meaning-to-speech, bypassing the need for text, is possible.Multimedia Computin

TU Delft Repository

Computational language documentation: some results from the BULB project

Author: Adda Gilles
Adda-Decker Martine
Ambouroue Odette
Besacier Laurent
Blachon David
Bonneau-Maynard Hélène
Gauthier Elodie
Godard Pierre
Hamlaoui Fatima
Idiatov Dmitry
Kouarata Guy-NoÃ«l
Lamel Lori
Makasso Emmanuel-Moselly
Mariani Joseph,
Rialland Annie
Stueker Sebastian
Van De Velde Mark
Yvon François
Zerbian Sabine
Publication venue: HAL CCSD
Publication date
Field of study

International audienceabstrac